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Abstract. Multivariate statistical analysis is concerned with observations on several 
variables which are thought to possess some degree of inter-dependence. Driven by prob- 
lems in genetics and the social sciences, it first flowered in the earlier half of the last 
century. Subsequently, random matrix theory (RMT) developed, initially within physics, 
and more recently widely in mathematics. While some of the central objects of study in 
RMT are identical to those of multivariate statistics, statistical theory was slow to exploit 
the connection. However, with vast data collection ever more common, data sets now 
often have as many or more variables than the number of individuals observed. In such 
contexts, the techniques and results of RMT have much to offer multivariate statistics. 
The paper reviews some of the progress to date. 
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1. Introduction 

Much current research in statistics, both in statistical theory, and in many areas of 
application, such as genomics, climatology or astronomy, focuses on the problems 
and opportunities posed by availability of large amounts of data. (More detail may 
be found, for example, in the paper by Fan and Li [40] in these proceedings.) There 
might be many variables and/or many observations on each variable. Loosely one 
can think of each variable as an additional dimension, and so many variables cor- 
responds to data sitting in a high dimensional space. Among several mathematical 
themes one could follow ~ Banach space theory, convex geometry, even topology 
- this paper focuses on Random Matrix Theory, and some of its interactions with 
important areas of what in statistics is called "Multivariate Analysis." 



*The author is grateful to Persi Diaconis, Noureddine EI Karoui, Peter Forrester, Matthew 
Harding, Plamen Koev, Debashis Paul, Donald Richards and Craig Tracy for advice and com- 
ments during the writing of this paper, to the Australian National University for hospitality, and 
to NSF DMS 0505303 and NIH ROl EB001988 for financial support. 
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Multivariate analysis deals with observations on more than one variable when 
there is or may be some dependence between the variables. The most basic phe- 
nomenon is that of correlation - the tendency of quantities to vary together: tall 
parents tend to have tall children. From the beginning, there has also been a 
focus on summarizing and interpreting data by reducing dimension, for example 
by methods such as Principal Components Analysis (PCA). While there are many 
methods and corresponding problems of mathematical interest, this paper con- 
centrates largely on PCA as a leading example, together with a few remarks on 
related problems. Other overviews with substantial statistical content include [5], 
[30] and [36]. 

In an effort to define terms and give an example, the earlier sections cover 
introductory material, to set the stage. The more recent work, in the later sections, 

concentrates on results and phenomena which appear in an asymptotic regime in 
which p, the number of variables increases to infinity, in proportion to sample 
size n. 



2 . B ackground 

2.1. Principal Components Analysis. Principal Components Analysis 
(PCA) is a standard technique of multivariate statistics, going back to Karl Pearson 
in 1901 [75] and Harold Hotelling in 1933 [51]. There is a huge literature [63] and 
interesting modern variants continue to appear [80, 87]. A brief description of the 
classical method, an example and references are included here for convenience. 

PCA is usually described first for abstract random variables, and then later as 
an algorithm for observed data. So first suppose we have p variables Xi, . . . ,Xp. 
We think of these as random variables though, initially, little more is asstuned than 
the existence of a covariance matrix E = {cjkk'), composed of the mean-corrected 
second moments 

ffefe' = Cov(Xfe,Xfe/) = £(Xfe - Hk){'>^k' - ^i-k')- 

The goal is to reduce dimensionality by constructing a smaller number of "de- 
rived" variables W = X^fc^fc^fci having variance 

Var(M^) = ^ VkCTkk'Vk' = v'^Sv. 

k,k' 

To concentrate the variation in as few derived variables as possible, one looks for 
vectors that maximize Var(VF). Successive linear combinations are sought that 
are orthogonal to those previously chosen. The principal component eigenvalues 
ij and principal component eigenvectors vj are thus obtained from 

ij = max{v'^Sv : v'^Vj/ = 0; f < j, |v] = I}. (I) 

In statistics, it is common to assume a stochastic model in terms of random 
variables whose distributions contain unknown parameters, which in the present 
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Figure 1. The n data observations are viewed as n points in p dimensional space, the 
p dimensions corresponding to the variables. The sample PC eigenvectors Vj create a 
rotation of the variables into the new derived variables, with most of the variation on 
the low dimension numbers. In this two dimensional picture, we might keep the first 
dimension and discard the second. 



case would be the covariance matrix and its resulting principal components. To 
estimate the unknown parameters of this model we have observed data, assumed 
to be n observations on each of the p variables. The observed data on variable 
is viewed as a vector S M". The vectors of observations on each variable are 
collected as rows into a p x n data matrix 

X = (xki) = [xi . . .Xp]^. 

A standard pre-processing step is to center each variable by subtracting the 
sample mean Xk = Xki, so that Xki ^ Xki — Xk- After this centering, define 

the p X p sample covariance matrix S = {skk') by 

S = (sfcfc') = n^^XX'^, Skk' = n^^ ^ XkiXk'i- 

i 

The derived variables in the sample, w = Xw = Vk'X-k, have sample variance 
Var(w) — v'^S'v. Maximising this quadratic form leads to successive sample prin- 
cipal components and from the sample analog of (1): 

Ij = maxjv^S-v : v^v^v = 0, j' < j, |v| 1} 
Equivalently, we obtain for j — 1, . . . ,p, 

Note the statistical convention: estimators derived from samples are shown with 
hats. Figure 1 shows a conventional picture illustrating PCA. 

Observed data are typically noisy, variable, and limited in quantity, so we are 
interested in the estimation errors 



i,{X)^£,, v,(X)-v,. 
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An additional key question in practice is: how many dimensions are "significant", 
or should be retained? One standard approach is to look at the percent of total 
variance explained by each of the principal components: 

An example. Menozzi, Piazza, and Cavalli-Sforza [70] is a celebrated exam- 
ple of the use of PCA in human genetics and anthropology. It was known from 
archaeological excavations that farming spread gradually from Near East across 
Europe 9000-5000 yrs ago (map below right). A motivating question was whether 
this represented spreading of the farmers themselves (and hence their genes) or 
transfer of technology to pre-existing populations (without a transfer of genes). 

Menozzi et al. [70] brought genetic data to bear on the issue. Simplifying 
considerably, the data matrix X consisted of observations on the frequencies of 
alleles of p = 38 genes in human populations at n = 400 locations in Europe. 
The authors sought to combine information from the 38 genes to arrive at a low 
dimensional summary. 

A special feature of the genetics data is that the observations i have associated 
locations \oc[i], so that it is possible to create a map from each of the principal 
components Wj , by making a contour plot of the values of the derived variable wj [i] 
at each of the sampling locations loc[z]. For example the first principal component 
(map below left) shows a clear trend from south-east to north-west, from Asia 
Minor to Britain and Scandinavia. The remarkable similarity of the PC map, 
derived from the gene frequencies, with the farming map, derived from archaeology, 
has been taken as strong support for the spread of the farmers themselves. 

For the genetics data, the first component (out of 38) explains pi = 27% of 
the variance, the second p2 = 18%, and the third p^ = 11%. Thus, and this is 
typical, more than half the variation is captured in the first three PCs. The second 
and third, and even subsequent PCs also show patterns with important linguistic 
and migratory interpretations. For more detail, we refer to books of Cavalli-Sforza 
[22, 23], from which the maps below are reproduced. 




'Be^^i^^^ lll^spttm. ,{|||iHgHi^ Hl^soW' 
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2.2. Gaussian & Wishart Distributions. For quantitative analysis, 
we need more specific assumptions about the process generating the data. The sim- 
plest and most conventional model assumes that the p random variables Xi, . . . , Xp 

follow a p— variatc Gaussian distribution A'p(/i, S), with mean /i and covariance 
matrix S, and with probability density function for X — (Xi, . . . ,Xp) given by 

/(X) = |x/2^S|-V2 exp{- 1 (X - m)^S-1(X - /x)}. 

The observed sample is assumed to consist of n independent draws Xi, . . . , X„ 

from X ^ Np{fi, E), collected into a. p x n data matrix X = [Xi . . . X„]. When 
focusing on covariances, it is a slight simplification to assume that /z = 0, as we 
shall here. In practice, this idealized model of independent draws from a Gaussian 
is generally at best approximately true - but we may find some reassurance in the 
dictum "All models are wrong, some are useful." [16] 

The (un-normalized) cross product matrix A = XX^ is said to have a p - 
variate Wishart distribution on n degrees of freedom. The distribution is named 
for John Wishart who in 1928 [97] derived the density function 

f{A) = c„,p|E|-"/2|A|(»-f-i)/2exp{-itr(E-i^)}, 

which is supported on the cone of non-negative definite matrices. Here c„_p is a 
normalizing constant, and it is assumed that E is positive definite and that n > p. 

The eigendecomposition of the Wishart matrix connects directly with Principal 
Components Analysis. Start with a Gaussian data matrix, form the covariance S, 
yielding a Wishart density for A = nS. The eigenvalues and vectors of A, given 

by 

Aui = liUi, li>...>lp>0, (2) 
are related to the principal component eigenvalues and vectors via 

li — Tl^i , Ui — . 

Canonical Correlations. We digress briefly from the PGA theme to mention 
one additional multivariate technique, also due to Hotelling [52], since it will help 
indicate the scope of the results. Given two sets of variables X = (Xi, . . . , Xp) and 
Y = (Yi, . . . , Yq), with a joint p + (/-variate Gaussian distribution, we may ask for 
that linear combination of X that is most correlated with some linear combination 
of Y, seeking the canonical correlations 

r? = maxGorr {ufX, vjY), 

Ui,Vi 

and the maximization is subject to \ui\ = \vi\ = 1. 

To take an example from climatology [8] : the X variables might be sea surface 
temperatures at various ocean locations, and the Y variables might be land temper- 
atures at various North American sites. The goal may be to find the combination 
of sea temperatures that is most tightly correlated with some combination of land 
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temperatures. For a recent example in functional magnetic resonance imaging, see 
[44]. 

If we have n draws (Xj, Yi), i = 1, . . . ,n from the joint distribution, the sample 

version of this problem may be written as a generalized cigcncquation that involves 
two independent matrices A and B, each following p—variate Wishart distributions 
- on g and n — q degrees of freedom respectively: 

Avj=r]{A + B)v,, r\>...>rl. 

The parameters of the Wishart distribution depend on those of the parent 
Gaussian distribution of the data - if X and Y are independent, then they both 
reduce to Wishart matrices with identity covariance matrix: A ~ /) and 

The Double Wishart setting. Suppose we have two independent Wishart ma- 
trices A Wp(ni,/) and B ^ Wp{n2,I), with the degrees of freedom parameters 
ni,n2 > P- We call this the double Wishart setting. Two remarks: By writing 
Wishart distributions with identity matrices, we emphasize, for now, the "null hy- 
pothesis" situation in which there is no assumed structure (compare Section 4). 
Second, by taking a limit with n2 — > oo, one recovers the single Wishart setting. 

Of central interest are the roots Xi,i = 1, ... ,p of the generalized eigenproblem 
constructed from A and B: 

det[x{A + B)-A]=0. (3) 

The canonical correlations problem is a leading example. In addition, essen- 
tially all of the classical multivariate techniques involve an cigcndecomposition 
that reduces to some form of this equation. Indeed, we may collect almost all the 
chapter titles in any classical multivariate statistics textbook (e.g. [3, 58, 68, 72]) 
into a table: 

Double Wishart Single Wishart 

Canonical correlation analysis Principal Component analysis 

Multivariate Analysis of Variance Factor analysis 

Multivariate regression analysis Multidimensional scaling 

Discriminant analysis 
Tests of equality of covariance matrices 



This table emphasizes the importance of finding the distribution of the roots 
of (3), which are basic to the use of these methods in applications. 

Joint density of the eigenvalues. The joint null hypothesis distribution of 
the eigenvalues for canonical correlations and principal components was found in 
1939. The results were more or less simultaneously obtained by five distinguished 
statisticians in three continents [41, 45, 54, 71, 81]: 

f{xi,...,Xp) = cY[w^^'^{xi)'[[{xi-Xj), xi>...>Xp, (4) 

i i<j 
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with 



.n—p—lp—x 



single Wishart 
double Wishart. 



iu{x) = 



a;"i-P-i(l-a;) 



"2-p-l 



The normalizing constant c is given, using the multivariate Gamma function rp{a) = 



Thus, the density has a product term involving each of the roots one at a time, 
through a weight function w which one recognizes as the weight function for two of 
the classical families of orthogonal polynomials, Laguerre and Jacobi respectively. 

The second product is the so-called "Jacobian" term, which arises in the change 
of variables to eigenvalue and eigenvector co-ordinates. It is this pairwisc interac- 
tion term, also recognizable as a Vandermonde determinant (see (13) below), that 
causes difficulty in the distribution theory. 

This result was the beginning of a rich era of multivariate distribution theory 
in India, Britain, the U.S., and Australia, summarized, for example, in [3, 68, 72]. 
While some of this theory became so complicated that it lost much influence on 
statistical practice, with new computational tools and theoretical perspectives the 
situation may change. 

2.3. Random Matrices. We detour around this theory and digress a mo- 
ment to introduce the role of random matrix theory. Beginning in the 1950s, 
physicists began to use random matrix models to study quantum phenomena. In 
quantum mechanics the energy levels of a system, such as the nucleus of a complex 
atom, are described by the eigenvalues of a Hermitian operator H, the Hamilto- 
nian: Hipi = Eiipi, with Eq < Ei < The low-lying energy levels can be 

understood by theoretical work, but at higher energy levels, for example in the 
millions, the analysis becomes too complicated. 

Wigner proposed taking the opposite approach, and sought a purely statistical 
description of an "ensemble" of energy levels - that could yield properties such as 
their empirical distribution and the distribution of spacings. He further made the 
hypothesis that the local statistical behavior of energy levels (or eigenvalues) is well 
modeled by that of the eigenvalues of a random matrix. Thus the approximation 
is to replace the Hermitian operator iJ by a large finite random N x N matrix 
Hn. 

One example of a statistical description that we will return to later is the 
celebrated SemiCircle Law [95, 96]. This refers to the eigenvalues of a so-called 
Wigner matrix Hn, with independent and identically distributed entries of mean 
and a finite variance u^. With no further conditions on the distribution of 
the matrix entries, the empirical distribution Fjv(i) = #{i : Xi < t}/N of the 
eigenvalues converges to a limit with density given by a semicircle: 



TT' 



■p(p-i)/4nr=ir(a 



{i - l)/2), by 




dFN{xaVN) —V4:-x^dx. 
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Ensembles and Orthogonal Polynomials Quite early on, there was 
interest in eigenvalue distributions whose densities could be described by more 
general families of weight functions than the Gaussian. For example, Fox and Kahn 
[43] used weight functions from the families of classical orthogonal polynomials. 
Analogies with statistical mechanics made it natural to introduce an additional 
(inverse temperature) parameter /3, so that the eigenvalue density takes the form 

JV 

/(a;i, . . . ,a;jv) = c JJ«;(a;j)^/^ |a;i -Xjf. (5) 

1 i<j 

At this time, it was only partially realized that in the case /? = 1, these densities 
were already known in statistics. But the table shows that in fact, the three 
classical orthogonal polynomial weight functions correspond to the three most 
basic null eigenvalue distributions in multivariate statistics: 

w{x) = e~^^/^ Hermite Hi- Gaussian 

x°'e~^ Lagucrre Wishart 

{1- x)''{l + x)'> Jacobi P^'*" Double WishaH 

Table 1. The orthogonal polynomials are taken in the standard forms given in Szego [86]. 

Dyson [34] showed that physically reasonable symmetry assumptions restricted 
/? to one of three values: 

Symmetry Type Matrix entries 
(3 = 1 orthogonal real 
(3 = 2 unitary complex 
/3 = 4 symplectic quaternion 

Mathematically, the complex-valued case is always the easiest to deal with, but of 
course it is the real case that is of primary statistical (and physical) interest; though 
cases with complex data do occur in applications, notably in communications. 

To summarize, the classical "null hypothesis" distributions in multivariate 
statistics correspond to the italicized eigenvalue densities in the 

Gaussian \ ( Orthogonal \ 
Laguerre > < Unitary > Ensemble. 
Jacobi J [ Symplectic J 

These arc often abbreviated to LOE, JUE, etc. We have not italicized the Sym- 
plectic case for lack (so far) of motivating statistical applications (though see [4]). 

Some uses of RMT in Statistics This table organizes some of the classical 
topics within RMT, and some of their uses in statistics and allied fields. This paper 
will focus selectively (topics in italics), and in particular on largest eigenvalue 
results and their use for an important class of hypothesis tests, where RMT brings 
something quite new in the approximations. 
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Bulk Graphical methods [92, 93] [finance [15, 77], communications [91]] 

Linear Statistics Hypothesis tests, distribution theory 

Extremes Hypothesis tests, distribution theory, role in proofs [21, 33] 

Spacings [[10], otherwise few so far] 

General Computational tools [65], role in proofs 

2.4. Asymptotic regimes. Types of Asymptotics The coincidence of 

ensembles between RMT and statistical theory is striking, but what can it be used 
for? The complexity of finite sample size distributions makes the use of asymptotic 
approximations appealing, and here an interesting dichotomy emerges. Traditional 
statistical approximations kept the number of variables p fixed while letting the 
sample size n ^ oo. This was in keeping with the needs of the times when the 
number of variables was usually small to moderate. 

On the other hand, the nuclear physics models were developed precisely for 
settings of high energy levels, and so the number of variables in the matrix models 
were large, as seen in the Wigner semi-circle limit. Interestingly, the many- variables 
limit of RMT is just what is needed for modern statistical theories with many 
variables. 

Stat: CWishart RMT: Laguerre UE 
Density 11^=1 a;""^e-^^A(a;) n^Li a^^e'^^ A(a;) 

^ variables: p N 

Sample size: n — p a 

Comparison of the parameters in the statistics and RMT versions of the Wishart 
density in the table above leads to an additional important remark: in statistics, 
there is no necessary relationship between sample size n and number of variables 
p. We will consider below limits in which p/n ^ j £ (0, oo), so that 7 could take 
any positive value. In contrast, the most natural asymptotics in the RMT model 
would take N large and a fixed. Thus, from the perspective of orthogonal polyno- 
mial theory, the statistics models lead to somewhat less usual Plancherel-Rotach 
asymptotics in which both parameters N and a of the Laguerre polynomials are 
large. 

Spreading of Sample Eigenvalues To make matters more concrete, we first 

describe this phenomenon by example. Consider n = 10 observations on a p = 10 
variable Gaussian distribution with identity covariance. The sample covariance 
matrix follows a Wishart density with n = p = 10, and the population eigenvalues 
£j{I) are all equal to 1. 

Nevertheless, there is an extreme spread in the sample eigenvalues £j = £j{S), 
indeed in a typical sample 

{ij) = (.003, .036, .095, .16, .30, .51, .78, 1.12, 1.40, 3.07) 

and the variation is over three orders of magnitude! Without some supporting 
theory, one might be tempted to (erroneously) conclude from the sample that the 
population eigenvalues are quite different from one another. 
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This spread of sample eigenvalues has long been known, indeed it is an example 
of the replusion of eigenvalues induced by the Vandermonde term in (4). It also 
complicates the estimation of population covariance matrices - also a long standing 
problem, discussed for example in [27, 47, 66, 85, 98]. 

The Quarter Circle Law Marcenko and Pastur [69] gave a systematic 
description of the spreading phenomenon: it is the version of the semi-circle law 
that applies to sample covariance matrices. We consider only the special case in 
which A Wp{n,I). The empirical distribution function (or empirical spectrum) 
counts how many sample eigenvalues fall below a given value t: 

Gp{t)=p-'i^{ij<t}. 

The empirical distribution has a limiting density g'^^ if sample size n and number 
of variables p grow together: p/n — > 7: 

9^'it) = ^^^^^^, b, = ii±v^r. 

The larger p is relative to n, the more spread out is the limiting density. In 
particular, with p = n/4, one gets the curve supported in [j, |]. For p = n, the 

extreme situation discussed above, the curve covers the full range from to 4, 
which corresponds to the huge condition numbers seen in the sample. 




Figure 2. Marcenko-Pastur limit density for 7 = | and 7 = 1. 



3. Largest Eigenvalue Laws 

Hypothesis Test for Largest Eigenvalue Suppose that in a sample of 
n = 10 observations from a p = 10 variate Gaussian distribution A''io(0, S), we 
see a largest sample eigenvalue of 4.25. Is the observed value consistent with an 
identity covariance matrix (with all population eigenvalues = 1), even though 4.25 
lies outside the support interval [0, 4] in the quarter-circle law? 

In statistical terms, we are testing a null hypothesis of identity covariance ma- 
trix, Hq •.T, = I against an alternative hypothesis Ha : E 7^ J that E has some more 
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general value. Normally, of course, one prefers the simpler model as a description 
of the data, unless forced by evidence to conclude otherwise. 

One might compare 4.25 to random samples of the largest eigenvalue from the 
null hypothesis distribution (three examples yielding 2.91, 3.40 and 3.50); but what 
is actually needed is an approximation to the null hypothesis distribution of the 
largest sample eigenvalue: 

P{ii>t I Ho = Wp{n,I)}. 

Tracy- Widom Limits Random matrix theory leads to the approximate 
distribution we need. In the single Wishart case, assume that A ^ Wp{n, I), either 

real or complex, that p/n — > 7 G (0,oo) and that £1 is the largest eigenvalue in 
equation (2). For the double Wishart case, assume that A ~ Wp{ni,I) is indepen- 
dent of B ~ Wp{n2,I), either real or complex together, and that {p/ni,p/n2) — > 
(71j72) G (0, 1)^, and that £1 is the largest generalized eigenvalue in equation (3). 
With appropriate centering jjnp and scaling anp detailed below, the distribution of 
the largest eigenvalue approaches one of the Tracy- Widom laws: 

P{n£i < Unp + a„ps\Ho} Ff3{s). (6) 

These laws were first found by Craig Tracy and Harold Widom [88, 89] in 
the setting of the Gaussian unitary and orthogonal ensembles, i.e. (Hermitian) 
symmetric Gaussian matrices with i.i.d. entries. There are elegant formulas for 
the distribution functions 

F2 (s) = exp - ^ {x- sfq{x)dx^ , Fi {sf = F^ (s) exp - ^ q{x)dx^ . 

in terms of the solution q to classical (Painleve II) non-linear second-order differ- 
ential equation 

q" = sq + 2q^, q{s) ^ Ai(s) as s — > 00. 

While q and fjg are somewhat tricky to compute numerically^, from the point of 
view of applied data analysis with a software package, it is a special function just 
like the normal curve. 

As will be seen from the explicit formulas (8)- (12) below, the scale of fluctuation 
(^np/i^np of the largest eigenvalue is 0(n~^/^) rather than the 0(n~^/^) seen in the 
Gaussian domain of attraction. This reflects the constraining effect of eigenvalue 
repulsion due to the Vandermonde term in (4). 

The fact that the same limit arises in the single and double Wishart settings 
(Laguerre, Jacobi ensembles) is an instance of the universality discussed in P. 
Deift's paper [29] in this volume. In a different direction, one can modiiy the 

^At time of writing, for available software in MATLAB see 

http://math.arizona.edu/ momar/research.htm and [31] in S-PLUS see 

http://www.vitrum.md/andrew/MScWrwck/codes.txt and [9]. Both are based on ideas of 
[76] [see also [35]] 
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assumption that the i.i.d. entries in the p x n data matrix X arc Gaussian. Sosh- 
nikov [82] shows that ifn—p = 0{p^^^) and the matrix entries have sufficiently 
light (subGaussian) tails, then the largest eigenvalue continues to have a limiting 
Tracy- Widom distribution. The behavior of the largest eigenvalues changes radi- 
cally with heavy tailed Xij - for Cauchy distributed entries, after scaling by rt^p^, 
[83, 84] shows a weak form of convergence to a Poisson process. If the density of 
the matrix entries behaves like jxj"^, then [13] give physical arguments to support 
a phase transition from Tracy- Widom to Poisson at /x = 4. 



Second-order accuracy To demonstrate the relevance of this limiting 
result for statistical application, it is important to investigate its accuracy when 
the parameters p and n are not so large. The generic rate of convergence of the left 
side of (6) to F/3{s) is 0{p~^^^). However, small modifications in the centering and 
scaling constants /U and a, detailed in the four specific cases below, lead to 0(p~^/^) 
errors, which one might call "second-order accuracy" . With this improvement, (6) 
takes the form 

\P{nii < iinp + <Jnp8\Ho} - Fp{s)\ < Ce-'^'p-'^l^. (7) 

This higher-order accuracy is reminiscent of that of the central limit, or normal, 

approximation to the t— test of elementary statistics for the testing of hypotheses 
about means, which occurs when the underlying data has a Gaussian distribution. 

Single Wishart, Complex Data. Convergence in the form (6) was first estab- 
lished by Johansson [57] as a byproduct of a remarkable analysis of a random 
growth model, with 

/I 1 \^/^ 

The second-order result (7) is due to El Karoui [38], If n'^p and a!^p denote the 
quantities in (8) with n and p replaced by n + 1/2 and p + 1/2, then the centering 
finp is a weighted combination of jJ-n-i^p and l^'n^p-i and the scaling (7„p a similar 
combination of cr^_i^p and cr'n^p-i- 

Single Wishart, Real Data. Convergence without rates in the form (6) to 
Fi{s) with centering and scaling as in (8) is given in [60]. The assumption that 
p/n J £ (0,oo) can be weakened to min{n,p} oo, as shown by El Karoui 
[37] - this extension is of considerable statistical importance since in many settings 
p^ n (see for example [40] in these proceedings). 

Analysis along the lines of [61] suggests that the second order result (7) will 
hold with 



l^np 



^np 



-i + v5^) (9) 
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Double Wishart, Complex Data. Set «; = ni + n2 + 1 and define 



sin 



sm 



(i) 



Then 



sm 



(11) 



(12) 



4k;^ sin (j) sin 7 

The second-order result (7) (currently without the exponential bound, i.e., with 
c = 0) is established in [61] with a weighted combination of /Lt° and and 
the scaling Unp a similar combination of cr° and <J°_i. 

Double Wishart, Real Data. Bound (7) is shown in [61] (again still for c = 0) 
with jjbnp and cr„p given by (12) with k = ni + n2 — 1. 

Approximation vs. Tables for p = 5 With second-order correction, 

Tracy- Widom approximation turns out to be surprisingly accurate. William Chen 
[24, 25, 26] has computed tables of the exact distribution in the double Wishart, 
real data, case that cover a wide range of the three parameters p, n\ and n2, and 
allow a comparison with the asymptotic approximation. Even for p = b variables, 
the TW approximation is quite good. Figure 3, across the entire range of ni and 
n2. 



Table vs. Approx at 95th %tlle: mc ■ (q-p-1 )/2; nc » (n-q-p-1 )/2 




Figure 3. A comparison of the 95th percentile, relevant for hypothesis tests, from Chen's 
table (dashed line) and the Tracy- Widom approximation (solid line). Chen's parameters 
mc, Tic are related to our double Wishart paramaters ni, n2 by m-c = (ni —p — l)/2, ric = 
(n2-p-l)/2. 



A different domain of attraction The Tracy- Widom laws are quite differ- 
ent from other distributions in the standard statistical library. A full probabilistic 
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understanding of their origin is still awaited (but sec [78] for a recent character- 
ization in terms of the low lying eigenvalues of a random operator of stochastic 
diffusion type). Instead, we offer some incomplete remarks as prelude to the orig- 
inal papers [88, 89]. 

Since one is looking at the largest of many eigenvalues, one might be re- 
minded of extreme value theory, which studies the behavior of the largest of a 
collection of variables, which in the simplest case arc independent. However, ex- 
treme value theory exploits the independence to study the maximum via products: 
{maxi<i<p k < t} = HiLi I{h < *} For eigenvalues, however, the Jacobian term, 
or Vandermonde determinant, 

V{1) = - k) = det[/f-i]i<i,fc<p, (13) 
changes everything. The theory uses the inclusion-exclusion relation: 

f{I{k<t}=Y^{-lf(''^i\I{k>t}. 

j=l fe=0 ^ ^ i=l 

The product structure of the left side, central to extreme value theory, is discarded 
in favor of the right side, which leads to an expression for P{maxi<j<j, Zj < t) in 
terms of so-called Fredholm determinants. 

For example, it is shown by Tracy and Widom [90] that for complex data 

P{max/i <t} = det(/ - KpX(t.,oo)), 

where xi is the indicator function for interval /, and Kp : ^ L2 is an operator 
whose kernel is the two-point correlation function 

p 

Kp{x,y) = ^(t)k{x)(t>k{y), 
fe=i 

written in terms of weighted orthonormal polynomials (l)k = h^^^^w^^'^Pk, where 
the polynomials pk and weight functions iv are given in Table 1 for the Gaussian, 
Wishart and double Wishart settings respectively. 
For real data, Tracy and Widom [90] show that 



P{maxZj <t} = ydet{I - K.pX{t,oc:)), 

where /Cp is now a 2 x 2 matrix- valued operator on L2 ^ L2. The corresponding 
kernel has form 



where Kp = Kp + ri and ri is a rank one kernel described in the three cases in more 
detail in [1, 42, 61]. Here D2 and ei denote partial differentiation and integration 
with respect to second and first variables respectively. 
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The expressions are thus somewhat more complicated in the real data case of 
primary interest in statistics. However they are amenable to analysis and approx- 
imation using orthogonal polynomial asymptotics near the largest zero, and to 
analysis based on the error terms to get the higher order approximation. 

Back to the Example We asked if an observed largest eigenvalue of 4.25 was 

consistent with Hq : S = / when n ~ p = 10. The Tracy- Widom approximation 
using moments (9)-(10) yields a 6% chance of seeing a value more extreme than 
4.25 even if "no structure" is present. Against the traditional 5% benchmark, this 
would not be strong enough evidence to discount the null hypothesis. 

This immediately raises a question about the power of the largest root test, 
namely evaluation of 

P{ii>t\ Wp{n,T,)} 

when S 7^ J. How different from 1 does Am£ix(S) need to be before Hq is likely to 
be rejected? To this we now turn. 



4. Beyond the Null Hypothesis 

From the perspective of multivariate distribution theory, we have, in a sense, barely 
scratched the surface with the classical RMT ensembles, since they correspond to 
symmetric situations with no structure in the population eigenvalues or covariance 
matrix. Basic statistical quantities like power of tests and confidence intervals, 
as well as common applications in signal processing, genetics or finance, call for 
distributions under structured, asymmetric values for the covariance matrix S. 

Statistical theory (pioneered by Alan James [56, e.g.], and summarized in the 
classic book by Robb Muirhead [72]) gives expressions for the classical multivari- 
ate eigenvalue distributions in more general settings, typically in terms of hyper- 
geometric functions of matrix argument. For example, if L = diag(Zi) are the 
eigenvalues of A Wp{n, E), then the joint eigenvalue density 

^^f,"- = |S|-"/2exp{itrL}oFo(-iS-\i), 
j/^ti, . . . , ip) 

with 

oFo{S,T)= I exp{tr{SUTU^)}dU, (14) 
Jo{p) 

and dU normalized Haar measure, but many other versions occur in the general 
theory. Despite recent major advances in computation by Alan Edelman and Pla- 
men Koev [64, 65], and considerable work on the use of Laplace approximations 
(see e.g. [19, 20]), statistical theory would benefit from further serviceable approx- 
imations to these typically rather intractable objects. 



Persistence of the Tracy-Widom Limit One basic question asks, in the 
setting of Principal Components Analysis, for what conditions on the covariance 
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S does the Tracy- Widom approximation continue to hold, 

P{ii < Mnp(S) + a„p(S)s} ^ F0{s), (15) 

perhaps with modified values for centering and scaling to reflect the value of E? 

Fascinating answers are beginning to emerge. For example, El Karoui [39] 
establishes that (15) holds, along with explicit formulas for /x„p(S) and fT„p(E), if 
enough eigenvalues accumulate near the largest eigenvalue, or if a small number of 
eigenvalues are not too isolated, as we describe below in a specific setting below. 

Some of the results are currently restricted to complex data, because they build 
in a crucial way on the determinantal representation of the unitary matrix integral 
(the complex analog of (14)) 

/ exp{trE-i[/L£/*}rft/ = c ^^l.^' (16) 

known as the Harish-Chandra-ltzykson-Zuber formula [50, 55], see also [46]. Here 
the eigenvalues of arc given by diag(7rj) and V(l) is the Vandcrmonde deter- 
minant (13). While it is thought unlikely that there are direct analogs of (16), 
we very much need extensions of the distributional results to real data: there are 
some results in the physics literature [17], but any statistical consequences are still 
unclear. 



Finite rank perturbations. We focus on a simple concrete model, and 
describe a phase transition phenomenon. Assume that 

^ = diag{h,...,eM,(Tl...,al), (17) 

so that a fixed number M of population eigenvalues are greater than the base level 
CTg, while both dimensions p and n increase in constant ratio p/n ^ 7 G (0, 00). 

First some heuristics: if all population eigenvalues are equal, then the largest 
sample eigenvalue ii has n~^/^ fluctuations around the upper limit of the support 
of the Marcenko-Pastur quarter circle law, the fluctuations being described by the 
Tracy- Widom law. For simplicity, consider M = 1 and = 1. If £1 is large 
and so very clearly separated from the bulk distribution, then one expects Gaus- 
sian fluctuations of order n~^/^, and this is confirmed by standard perturbation 
analysis. 

Baik et al. [7] describe, for complex data, a 'phase transition' that occurs be- 
tween these two extremes. If ^1 < 1 -|- ^/7, then 



n2/3(ii - ^)/a 




4 = 1 + x/7 



where, from (8), we may set 

M=(l+V7)^ a = (l+v7)(l + ^)^/^ 
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and F2 is related to F2 as described in Balk et al. [7]. On the other hand, if 

^1 > 1 + V7, 

n^/^ih - t,{h))/a{h) ^ N{0,1), 

with 

M^O = ^i(i + ^), -^(^0 = ^?(i-(^). (18) 

Thus, below the phase transition the distribution of ii is unchanged, Tracy- 
Widom, regardless of the value of £1. As £1 increases through 1 + y^, the law of £1 
jumps to Gaussian and the mean increases with £1, but is biased low, < ii, 

while the variance cr'^{£i) is lower than its value, i^, in the limit with p fixed. 

A key feature is that the phase transition point 1 + located at the zero 
of cr(fi), is buried deep inside the bulk, whose upper limit is (1 + y^)^. A good 
heuristic explanation for this location is still lacking, though see El Karoui [39]. 

Further results on almost sure and Gaussian limits for both real and complex 
data, and under weaker distributional assumptions have been obtained by Paul 
[74] and Baik and Silverstein [6]. 

A recent example. Harding [49] illustrates simply this phase transition 
phenomenon in a setting from economics and finance. In a way this is a negative 
example for PGA; but statistical theory is as concerned with describing the limits 
of techniques as their successes. 

Factor analysis models, of recently renewed interest in economics, attempt to 
"explain" the prices or returns of a portfolio of securities in terms of a small 
number of common "factors" combined with security-specific noise terms. It has 
been further postulated that one could estimate the number and sizes of these 
factors using PGA. In a 1989 paper that is widely cited and taught in economics 
and finance. Brown [18] gave a realistic simulation example that challenged this 
view, in a way that remained incompletely understood until recently. 

Brown's example postulated four independent factors, with the parameters of 
the model calibrated to historical data from the New York Stock Exchange. The 
return in period t of security k is assumed to be given by 

Rkt^K=ibk.Ut + eku k = l,...,p; t = l,...,T, (19) 

where it is assumed that bkv ~ N{13,(jI), f„t ~ -^(0,crj) and e^t ~ -^(0,o'e)) all 
independently of one another. The population covariance matrix has the form (17) 
with M = 4 and 

Ij = pa) {al + 4/3Sji ) + al j = 1 , . . . , 4. (20) 

Here Sji is the Kronecker delta, equal to 1 for j = 1 and otherwise. Figure 4(a) 
plots the population eigenvalues £1 (the dominant 'market' factor), the common 
value £2 = ^3 = ^4 and the base value £5 = a^ against p, the number of securities 
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in the portfolio. One might expect to be able to recover an estimate of ^2 from 
empirical data, but this turns out to be impossible for p S [50, 200] when T = 80 
as shown in Figure 4(b). First, the range of observed values of the top or market 
eigenvalue is biased upward from the true top eigenvalue. In addition, there are 
many sample eigenvalues above the anticipated value for £2- 




Figure 4. Population and sample eigenvalues for a four factor model (19) with /3 = 
0.6, at = .4, (J/ = .01257, = .0671. [Brown & Harding use fi — 1, at = .1; the values are 
modified here for legibility of the plot.] (a) Left panel: Population eigenvalues according 
to (20) (b) Right panel: The top sample eigenvalue in replications spreads about a sample 
average line which tracks the solid line given by (18), in particular overestimating the 
population value £1. The next nine sample eigenvalues fall at or below the Marcenko- 
Pastur upper limit, swamping the next three population eigenvalues. 

Harding shows that one can directly apply the (real version) of the phase tran- 
sition results previously discussed to fully explain Brown's results. Indeed, the 
inability to identify factors is because they fall on the wrong side of the phase 
transition (7^(1 + y/p/T), and so we can not expect the observed eigenvalue esti- 
mates to exceed the Marcenko-Pastur upper bound ct'^{1 + y^p/T)"^. Finally, the 
bias between the observed and true values of the top eigenvalue is also accurately 
predicted by the random matrix formulas (18). 



5. Estimating Eigenvectors 

Most of the literature at the intersection of random matrix theory and statis- 
tics is focused on eigenvalues. We close with a few remarks on the estimation of 
eigenvectors. Of course, the question is only meaningful in non-symmetric set- 
tings when the covariance matrix S is not proportional to /. We again assume 
that S ~ VFp(n, S) and now focus attention on covariance models which are a 
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finite-rank perturbation of the identity : 

M 



a2/ + ^A,Mj, (21) 



v=l 

with Ai > . . . > Am > and {9,j} orthonormal. We ask how well can the popula- 
tion eigenvectors 9,^ be estimated when both p and n are large. 

First some remarks on how model (21) can arise from an orthogonal factor or 
variance components model for the data. Assume that the p— dimensional obser- 
vations Xi,i = 1, . . . , n have the form 

M 



i/=i 

where {v^i : 1 < i' < M} are i.i.d. A''(0, 1), independently of Zi ~ Np{0,lp), for 
all i. If we further assume, for convenience, that fi = 0, then with the sample 
covariance S defined as in Section 2, then S ~ Wp{n, E). If we express Xj, 9i, and 
Zi in (22) in terms of co-ordinates in a suitable basis {e^. A; = 1, . . . ,p} and write 
fvi = ^J\jVvi we obtain 

M 

Xki = 9kiffvi + O'Zki, 

in which 9kv is viewed as the factor loading of the A;th variable on the i^th factor, 
and fvi is the factor score of the v\h. factor for the ith individual. As we have seen 
in (19) in the previous section, in economics X^i may represent the return on the 
fcth security in time period i. 

Assume that Ai > ... > Am > 0. Let 9^ denote the normalized sample 
eigenvectors of S (denoted in Section 2.1) associated with the M largest sample 
eigenvalues. In classical asymptotics, with n large and p fixed, there is a well 
understood Gaussian limit theory: 

^fTi(9v-9v)^Np{SS,Vv) (22) 

where is given, for example, in [2, 3]. 

The situation is radically different when p/n ^ 7 > - indeed, ordinary PCA 
is necessarily inconsistent: 



(^i/, 9v) 




K e [0, ^] 



For signal strengths A below the phase transition just discussed, nothing can be 
estimated - the estimate is asymptotically orthogonal to the truth. The angle 

decreases as A^ grows, but is never exactly consistent. 



■^The situation is diflferent in functional Principal Components Analysis, where smoothness of 
the observed data (functions) leads to covariance matrices with smoothly decaying eigenvalues. 
For entries into this literature, see for example [14, 28, 48] 
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This result has emerged in several literatures, starting in the learning the- 
ory/statistical physics community, with non-rigorous arguments based on the replica 
method [53, 79], where this phenomenon has been termed "retarded learning" 
[11, 94]. More recently, rigorous results have been obtained [62, 73, 74]. 

To obtain consistent estimates, further assumptions are needed. One plausible 
situation is that in which there exists a basis {ek}k=i:p in which it is believed that 
the vectors 0,^ have a sparse representation. In microarray genetics, for example 
Xki might be the expression of gene k in the zth patient, and it may be believed 
that (in the standard basis) each factor u is related to only a small number of 
genes [67]. In EEG studies of the heart, the beat-to-beat cycle might be expressed 
in a wavelet basis, in which the components of variation 9i, may well be sparsely 
represented [62]. 

We briefly describe results in the sparse setting of work in progress by D. Paul, 
and by Paul and the author. For simplicity only, we specialize to M = 1. The 
error of estimation, or loss, of 6 is measured on unit vectors by 

L{0,e) = 11^ - sign((^, 6i))6'f =4sin2 1/(^,6*). 
If 9 is now the ordinary PCA estimate of 9, and if p/n ^ 7 > 0, then to first order, 

from which it is natural to define the "per- variable" noise level t„ = 1/ yJnh{X). 

As is common in non-parametric estimation theory, we use Iq norm, g < 2, as 
a measure of sparsity: with ||(9||« = Y.k l^fcP' define @q{C) = {9 & Sp-'^ : \\9\\g < 
C}. Paul proposes a two-step procedure for selecting a reduced subset of variables 
on which to perform PCA, resulting in an estimator 9^ for which 

sup EL{9^, 9) < K{C) logp • m^r^. (23) 
flee,(C) 

Here m„ is an eflFective dimension parameter, equal to (C-^/(t^ log p))'^/^ in the 
"sparse" case when this is smaller than c\p, and equal to p in the contrary "dense" 
case. Lower bounds are obtained that show that this estimation error is optimal, 
in a minimax sense, up to factors that are at most logarithmic in p. 

Bounds such as (23) are reminiscent of those for estimation of sparse mean 
sequences in white Gaussian noise [12, 32, 59]. An observation due to Paul provides 
a link between eigenvector estimation and the estimation of means. Again with 
M = 1 for simplicity, let 9 be the ordinary PCA estimate of 9. Write C = {9, 9) 
and 9-^=9- C9. Then, with 5^ = 1 - (7^, in the decomposition 

9 = C9 + SU, U = 9^/\\9^\\ 

it happens that U is uniformly distributed on a copy of Sp~^, independently of S. 

It is a classical remark that a high-dimensional isotropic Gaussian vector is 
essentially concentrated uniformly on a sphere. We may reverse this remark by 
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starting with a uniform distribution on a sphere, and introducing an ultimately 
inconsequential randomization with ~ Xp-i/P ^tnd zi ~ iV(0, 1/p) with the 
result that z = RU + Zi9 has an Np{Q,I) distribution. This leads to a signal-in- 
Gaussian-noise representation 

Y = Ce + T^z, = l/{2nh{X)), 

Work is in progress to use this connection to improve the extant estimation results 
for eigenvectors. 



6. Coda 

One may expect a continuing fruitful influence of developments in random matrix 
theory on high dimensional statistical theory, and perhaps even some flow of ideas 
in the opposite direction. A snapshot of current trends may be obtained from 
http : //www . samsi . inf o/workshops/2006rarmiat-opening200609 . shtml, being 
the presentations from the Opening Workshop of a semester devoted to High 
Dimensional Inference and Random Matrices at the NSF Statistics and Applied 
Mathematics Institute in Fall 2006. 
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